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Abstract 

Background: Assessment of potential allergenicity of protein is necessary whenever transgenic proteins are 
introduced into the food chain. Bioinformatics approaches in allergen prediction have evolved appreciably in 
recent years to increase sophistication and performance. However, what are the critical features for protein's 
allergenicity have been not fully investigated yet. 

Results: We presented a more comprehensive model in 128 features space for allergenic proteins prediction by 
integrating various properties of proteins, such as biochemical and physicochemical properties, sequential features 
and subcellular locations. The overall accuracy in the cross-validation reached 93.42% to 100% with our new 
method. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) 
procedure were applied to obtain which features are essential for allergenicity. Results of the performance 
comparisons showed the superior of our method to the existing methods used widely. More importantly, it was 
observed that the features of subcellular locations and amino acid composition played major roles in determining 
the allergenicity of proteins, particularly extracellular/cell surface and vacuole of the subcellular locations for wheat 
and soybean. To facilitate the allergen prediction, we implemented our computational method in a web 
application, which can be available at http://gmobl.sjtu.edu.cn/PREAL/index.php. 

Conclusions: Our new approach could improve the accuracy of allergen prediction. And the findings may provide 
novel insights for the mechanism of allergies. 



Background 

Allergens are something that can induce type-I hyper- 
sensitivity reaction in atopic individuals mediated by 
Immunoglobulin E (IgE) responses [1-4], which are ser- 
iously harmful to human health. For instance, allergenic 
proteins in food and other hypersensitivity reactions are 
major causes of chronic ill health in affluent industrial 
nations, mostly against milk, eggs, peanuts, soy, or 
wheat, affecting up to 8% of infants and young children 
[5-7]. Moreover, the introduction of genetically modified 
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foods and new modified proteins is increasing the risk of 
food allergy in susceptible individuals as well [8,9]. Con- 
sequently, assessing the potential allergenicity of proteins 
is essential to prevent the inadvertent generation of new 
allergenic food by agricultural biotechnology. 

In 2001, the World Health Organization (WHO) and 
Food and Agriculture Organization (FAO) proposed 
guidelines to assess the potential allergencity of a pro- 
tein, an important part of which is to use bioinformatic 
methods to determine whether the primary structure 
(amino acid sequence) of a given protein is sufficiently 
similar to sequences of known allergenic proteins 
[10,11]. In FAO/WHO rules, a protein is identified as a 
putative allergen if it has at least six contiguous amino 
acids matched exactly (rule 1) or a minimum of 35% 
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sequence similarity over a window of 80 amino acids 
(rule 2) when compared with known allergens. Some 
researches have shown that the bioinformatic rules of 
FAO/WHO produced many false positives for allergen 
prediction [12-19]. Since then, a number of other compu- 
tational prediction methods based on the protein struc- 
ture or sequence similarity comparing with known 
allergens have been reported [18,20-26]. For example, a 
new approach brought an increase of the precision from 
37.6% to 94.8% by identifying motifs from known allergen 
in 2003 [18]. Statistical learning method SVM (support 
vector machine) was used for predicting allergens since 
2006, and the input features of most SVM-based predic- 
tion approaches were compose of either amino acid com- 
position or pair-wise sequence similarity score with 
known allergens' [20-24,27]. Furthermore, using identify- 
ing epitope, allergen representative peptides or family 
featured peptides were also applied in the allergen predic- 
tion [20,25,26]. But the usage of these two methods was 
limited because very few epitopes and allergen represen- 
tative peptides have been known until now. 

In our previous study, it's observed that, although 
FAO/WHO criteria have a higher sensitivity and the 
motif-based approach may give a graph view on the key 
allergenic motif, we found that the SVM-based method 
is superior to the others in the accuracy of allergen pre- 
diction and processing time [28]. As described as above, 
a variety of bioinformatic methods for predicting aller- 
gen have been reported, most of these approaches 
depend upon the similarity of protein sequence or pri- 
mary sequential properties between query protein and 
the known allergens only. Here, besides protein sequen- 
tial features, we developed an improved model for iden- 
tifying potential protein allergenicity using 128 features 
in terms of their biochemical, physicochemical, subcellu- 
lar locations. And then, all features were ranked using 
mRMR (maximum relevance & minimum redundancy) 
method and an optimal model was rebuilt and evaluated 
with ten-fold cross validations. At last, we presented a 
web-based application with a friendly interface that 
allows users submit individual or batch prediction with 
query protein or protein list using our new method. 

Methods 

Datasets 

1176 distinct allergen proteins were collected from 
Swiss-Prot Allergen Index, lUIS Allergen Nomenclature, 
SDAP [26] and ADFS [29], and were used as the posi- 
tive dataset. To build a reliable negative dataset, we inte- 
grated the previously reported methods[13,18,22], and 
the following processing was done: (1) 522,019 protein 
entries were downloaded from Swiss-Prot (Swiss-Prot 
Release 2010_11 of 02-Nov-lO); (2) the entries were 
removed, of which sequence identities > = 30% with any 



known allergen; (3) all sequences less than 50 amino 
acid were also discarded; (4) the same number of the 
negative samples were selected randomly from the 
remaining subjects in the following cross-validations of 
the evaluation. 

Software 

NCBI-BLAST (version 2.2.23) was used to find the simi- 
larity between sequences [30]. SSpro/ACCpro 4.03 
[31,32], for predicting secondary structure and solvent 
accessibility of protein, were obtained from http://down- 
load.igb.uci.edu/. In order to access a protein as an aller- 
gen or non-allergen, SVM method was implemented 
using LIBSVM software v3.0 [33], from http://www.csie. 
ntu.edu. tw/~cjlin/libsvm/. The mRMR program [34], 
from http://penglab.janelia.org/proj/mRMR/, was 
acquired for feature ranging and selection. A Perl script 
was written for protein features extraction and allergeni- 
city prediction. ClustalX2 and Muscle was used for mul- 
tiple sequence alignments with the default parameters 
[35,36]. The NJ (Neighbour-Joining) tree was con- 
structed with the aligned protein sequences using 
MEGA (version 5) with the following parameters: pois- 
son correction, pairwise deletion, and bootstrap (1,000 
replicates; random seed) [37]. 

Feature vector construction 

(1) Features of biochemistry and physicochemistry 

The following six kinds of biochemical and physico- 
chemical properties were extracted from a given protein 
sequence: (1) amino acid composition (AAC), (2) mole- 
cular weight (MW), (3) hydrophobicity, (4) polarizability, 
(5) normalized van der Waals volume (NWV), and (6) 
polarity. 

AAC is the fraction of each amino acid in a protein 
[20]. The fraction of all 20 natural amino acids was cal- 
culated using the Eq. (1). 

total number of amino acids (0 

Fraction of ammo acid i = , { 1 ) 

total number of ammo acids in protein ^ ' 

where fcan be any amino acid. 

The molecular weight was considered in this study 
since some researches showed that it's related with aller- 
gen identification [38-42]. Except for AAC and MW 
that reflect global feature of a protein, of the above six 
types of properties, the construction of all the other 
four types of biochemical and physicochemical proper- 
ties, which is related with a single amino acid in a given 
protein sequence was adopted from the report of Huang 
et al. [43]. Each of these local types of properties can be 
classified into three categories. For instance, an amino 
acid can be grouped as: polar, neutral or hydrophobic 
for the hydrophobicity. Similarly, the classifications of 
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Table 1 The classification of protein properties 



Property type 


Category 


Amino acid 


Hydrophobicity 


Polar 


R, K, E, D, Q, N 




Neutral 


G, A, S, T, P, H, Y 




Hydrophobic 


C, V, L, l,M, F,W 


Polarizability 


0-0.108 


G, A, S, D, T 




0.128-0.186 


C, P, N, V, E, Q, 1, L 




0.219-0.409 


K, M, H, F, R, Y, W 




0-2.78 


G, A, S, C, T, P, D 




2.95-4.0 


N, V, E, Q, 1, L 




4.43-8.08 


M, H, K, F, R, Y, W 


Polarity 


4.9-6.2 


L, 1, F, W, C, M, V, Y 




8.0-9.2 


P, A, T, G, S 




10.4-13.0 


H, Q, R, K, N, E, D 


SSP'' 


Helix 


Predicted by SSpro [31] 




Strand 






Coil 




Solvent 


Buried 


Predicted by ACCpro [32] 




Exposed 





^normalized van der Waals volume; "^secondary structure propensity. 



polarizability, NWV and polarity were also summarized 
in Table 1 [44-46]. And then, in term of each type of 
property above, the 20 elements of original protein 
sequence can be recoded using the corresponding three 
local features such as P (polar), N (neutral) and H 
(hydrophobic). At last, with method developed by 
Huang et al. [43], the coded sequence can be integrated 
into the corresponding global features: C (composition), 
T (transition) and D (distribution). C refers to the global 
composition of each of the three groups (3 elements), 
while T is defined as the proportion of transformation 
of each pair letters on the total changes along the entire 
coded sequence (3 elements), and D expresses the distri- 
bution pattern of the code letters which is measured by 
the position of the first, 25%, 50%, 75%, and 100% of 
each of the three letters along the sequence (5*3 = 15 
elements). Therefore the properties which classified into 
three categories would generate 21 features each (3-1-3 
+15 = 21). 

(2) Subcellular location description of proteins 

The protein's subcellular location information was also 
incorporated in input features for SVM, because it is 
closely correlated with the function of a protein [47,48]. 
There were 22 subcellular locations for eukaryotic pro- 
teins collect from UniProt [49], therefore, we repre- 
sented the subcellular location features by a 22- 
dimensional vector SL = (5/1,5/2,5/3, ,5/22); where 
5/1 = 1 refers that the query protein is located at the i -th 
subcellular location site. Conversely, sU = 0 refers that 
the query protein is not found at the i -th subcellular 
location site [43]. However, proteins have subcellular 



location annotations are in the minority. In order to 
solve this issue, we predicted the localization informa- 
tion for those without annotation based on the sequence 
similarity with location-known proteins. Upon the 
sequence similarity evaluated by BLAST [30], the query 
protein was considered to have the same subcellular 
locations with a location-known protein if the BLAST 
score was greater than 120 between them [43]. 

(3) Feature space 

As mentioned above, hydrophobicity, polarizability, 
NWV and polarity generated 21 elements each. And 
there were 20 elements for AAC, 1 element for MW 
and 22 for subcellular locations. In addition, the length 
of protein was also counted as a component. Therefore, 
the total feature space to represent a protein sample 
contained (21*4+20+1+22+1) = 128 components, as 
listed in Additional file 1 for the details. Consequently, a 
protein sample can be formulated as a vector in a 128-D 
(dimensional) space; i.e., 

V = [v\,V2,V3, • • •,yj, • • •,yi2sY 

where is the j-th (j = 1,2,. ..,128) component of the 
protein. 

To enhance the accuracy of SVM, each of the 128 fea- 
tures in Eq.2 was scaled by Eq.3. 

= (j=h2r" (3) 

where /^jis the mean, and ^jis the standard deviation of 
the j -th component over all protein samples. 

Feature selection 

(1) mRMR method 

mRMR method was developed to rank each feature 
according to its relevance to the target and redundancy 
with other features [34]. The program of mRMR was 
downloaded from http://penglab.janelia.org/proj/mRMR/ 
, and run with the parameters: A = L m = MID. 

(2) Incremental Feature Selection (IFS) 

As mentioned above, the feature components could be 
ranked using mRMR method. But it's not uncovered 
that which components of the feature would be most 
necessary. The IFS method was adopted in this study to 
perform feature selection for analyzing the key proper- 
ties related to allergenicity. Based on the ranked features 
obtained from the mRMR, 128 feature sets were con- 
structed by adding one component to the set at a time 
in the order of mRMR features list. The f -th set is 
formed like S'^ = [f[,f'^, >f'i] (1< 1 < 128), where f. 
means the feature at the i -th position after ranking by 
mRMR. 
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For each of feature sets, an SVM predictor was con- 
structed and its ten-fold cross-validation performance 
was derived. Eventually, an IFS curve was obtained, with 
the component number i as its X-axis and the corre- 
sponding sensitivity, specificity and accuracy as its Y- 
axis. If the IFS curve has a inflection point at X=h, the 
feature set that played a key role in allergenicity would 

be S optimal = {fi'fi'' ' ' 'fh}' 

Ten fold cross-validation 

The performances of all methods applied in this study 
were evaluated using ten-fold cross-validation. The 
dataset was randomly partitioned into ten subsets, 
where each subset has nearly equal number of allergens 
and non-allergens (negative controls). Of the ten sub- 
sets, a single set was retained as the validation data for 
testing the method, and the remaining nine subsets 
were used as training data. This process was then 
repeated 10 times with each of the ten subsets used 
exactly once as the validation data. The overall perfor- 
mance of a method was the average performance over 
ten subsets. 

Results 

Model construction with IFS 

As described in the method section, 128 feature sets were 
built, and the corresponding prediction models were then 
constructed and evaluated. As shown in Figure 1, it 
reached the inflection point of IFS curve at accuracy of 
91.03% when the number of feature components used 
was 25. In other words, these 25 feature components 
selected by mRMR would compose the critical feature set 
for the classifier of allergen/non-allergen. We analyzed 
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Figure 1 IFS curves of all proteins in training dataset. IFS curves 
of 128-D feature space. The overall accuracy reached its inflection 
point of 91.03% at the number of feature components used was 25. 



the 25 feature components in the next section to under- 
stand key factors for protein s allergenicity. 

Optimization of feature components 

To investigate which features are crucial for protein's 
allergenicity, we extracted the 25 feature components at 
the inflection point from mRMR list, in which two of 
five property types, "subcellular locations" and "amino 
acid composition", were significantly enriched by hyper- 
geometric test (p-value < 0.05, Benjamini-Hochberg cor- 
rection) (Table 2). A heatmap in Figure 2 also illustrated 
that the features of AAC and SL (subcellular locations) 
were remarkable [50]. We further try to figure out 
which of the 22 subcellular locations of particular 
importance in allergen prediction by taking look at the 
SL distribution in soybean {Glycine max) and wheat 
{Triticum aestivum). So far, these two species had most 

Table 2 The optimal feature components 





SL 


AAC 


Hyd 


Pola 


NW 




22 


20 


21 


21 


21 




9 


8 


3 


3 


1 


p-value 


0.0365 


0.0345 


0.2052 


0.2052 


0.06983 


^ PN means successes number in population; ^ SN means successes number 


in sample. 
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Figure 2 The heatmap of feature compositions of different 
property types. Tine vertical axis represents tine feature 
composition of eiglit types of protein properties, i.e. SL (subcellular 
locations), AAC (amino acid composition), Pola (polarity), Hydr 
(hydrophobicity), Len (length), NWV (normalized van der Waals 
volume), MW (molecular weight) and Polz (polarizability). While 
horizontal axis shows the selected numbers of top features in IFS 
procedure signed as Tx, in which x denotes the number features. 
And the warmer colour denotes the higher correlation. The 
properties with star (SL* and AAC*) performed remarkable. 
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known allergenic proteins. The results revealed that 
endoplasmic reticulum for soybean only and other two 
SL (extracellular/cell surface and vacuole) for both soy- 
bean and wheat were significantly more enriched in 
allergens compared to randomly selected proteins 
(p-value < 0.05) (Table 3 and Additional file 2). 



Table 3 The subcellular location analysis 



Corrected p-value 


End-ret^ 


Extr-sur^ 


Vacuole 


Glycine max 


0.0003 


0.0210 


3.8E-9 


Triticum aestivum 




0.0036 


0.0314 



^End-ret means Endoplasmic reticulum; ^Extr-sur means Extracellular +cell surface. 



Allergen predicting by category 

Since people who concern about allergenicity usually 
focus more on a specific species or category like food- 
plant rather than all species, we performed a multi- 
alignment and constructed a phylogeny tree using 
MEGA software (version 5.0) [37] for 116 allergens 
which sequence length is between 240 and 600, from 
the biggest two sub-families in six major categories 
(Aero-Fungi, Animal, Apple, Food-Plant, Mite and Pol- 
len) respectively. 909 allergens were included in these 
six major categories, which account for over 77% of all 
allergens. The NJ (Neighbour-Joining) tree (Figure 3, 
Additional file 3) illustrated that the sequences of 
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tr\Q9HF11\aerofungi 
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• tr\Q9Y755\aerofungi 
sp\P22285\pollen 
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95 L# tr\081341\pollen 
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• tr\081343\pollen 
sp\P56165\pollen 



Figure 3 The NJ tree of 35 allergen sequences from five categories. The topology of this tree was generated using MEGA 5, summarizing 
the evolutionary relationships among the allergens from different categories. The branches of the same category were colour-coded. 
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Figure 4 Performance comparison in Pollen, Apple and all known allergens. The chart illustrates the performance comparison of predictors 
based on 128-D feature vector models within Pollen and Apple against within all known allergens. 



allergens were more conservative within category than 
between categories. Hence, we attempted to build and 
evaluate our predictor within Aero-Fungi, Animal, 
Apple, Food-Plant, Mite and Pollen individually. 
As displayed in Figure 4, the category-specific models 
in Pollen and Apple outperformed full model. Even 
the accuracy of allergen prediction in Apple can 
reach 100%. 

Comparison with existing methods 

We compared the performance of our method with the 
existing approaches for allergen prediction. So far there 
are three major kinds of computational methods for 
allergen prediction including FAO/WHO criteria, motif- 
based method and SVM-based method. Among the 
SVM-based methods, SVM-AAC taking the amino acid 
composition as feature vectors is mostly common used. 
The ROC curves illustrated the superiority of our 128-D 
feature vector models to the others, in which the overall 
accuracy reached its peak of 93.42% (Figure 5). 

Web-based application 

A web server named PREAL (http://gmobl.sjtu.edu.cn/ 
PREAL/index.php) has been developed that allows peo- 
ple evaluate the potential allergenicity of protein(s) on- 
line using our new method. When a query protein 
sequence in FASTA format is given, PREAL will report 
the putative allergenicity. Besides, both category-specific 
and full model are available in PREAL. PREAL also pro- 
vides batch prediction, which returns the results by E- 
mail. A snapshot of the prediction page of PREAL was 
displayed in Figure 6. 






sequence-based rule1 






-■ sequence-based rule2 






motif-based 






SVM-AAC 




t 


-# Our-method 







0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

1 -Specificity 

Figure 5 The ROC curves of various approaches for allergen 
prediction. 



Discussion and conclusions 

The aim of this study is to predict the potential allergeni- 
city of proteins efficiently and analyze the key factors 
resulted in allergenicity. We developed a new SVM-based 
model by integrating various biochemical and physico- 
chemical properties, as well as sequential features and 
subcellular locations. The ten-fold cross-validation indi- 
cated that the predictor can achieve from 93.42% to 
100% overall accuracy. Considering the secondary struc- 
ture propensity and solvent accessibility contribute to the 
protein's stability and function, we also expanded our 
model by adding these two kinds of property. As pre- 
dicted by SSpro [31], an amino acid can be grouped as: 
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PREAL : Allergen Prediction Program 

Please provide protein sequence in fasta format (one sequence only) 

Sample sequences: sample 1 sample 2 



Probability threshold (0.0-1.0): |0.5 | 
I run Allertor II Reset | 

Batch-predict: Batch Prediction Program 

Upload a FASTA-format file containing multiple protein sequences to be predicted for allergenicity. 
Results of the prediction will be returned to you at the email address that you specify. 



Email address: 



Submit Reset 



Figure 6 A snapshot of the prediction page of the web application. 



helix, strand or coil for the secondary structure propen- 
sity (SSP), and the solvent accessibility can be classified 
into buried or exposed to solvent predicted by ACCpro 
(Table 1) [32]. Finally the model can be formulated as a 
vector in a 156-D (dimensional) space. But the corre- 
sponding evaluation indicated the overall accuracy could 
be increased only 0.01 by the 156 features model while 
its running time was more than 60 times longer than the 
128-D model. 

With the feature selection procedure based on the 
mRMR and IFS methods, we found that the subcellular 
locations and amino acids composition would play the 
crucial roles in determining the allergenicity of a pro- 
tein. For soybean and wheat, the extracellular/cell sur- 
face and vacuole are observed to be the exactly effective 
locations. Key effect factors for allergenicity have not 
been reported before. Because allergenic proteins had 
higher sequence similarities within categories, we also 
carried out the predictor in six major sub-sets in which 
higher accuracy was obtained. To facilitate application, 
we built a web-based application providing the predic- 
tion approach presented in this paper on-line, so that 
people can perform a test even large-scale testing 
expediently. 

Despite this, there are some issues should be 
addressed in further the study. Although the allergen 



prediction within category preformed pretty well, small 
amount of allergenic proteins were captured within 
some category limited its wide usage. Another issue is 
the difficulty in effective validation of a new method 
presented by wet experiments expect for the cross- 
validation. 

Additional material 



Additional file 1: The 128 features for allergen protein 
identification. 

Additional file 2: The statistical data of subcellular locations for 
soybean and wheat. There are 22 subcellular locatior^s (SL) for 
eukaryotic proteir^s. Or^ly SL terms located by 3 more allerger^s were 
calculated. 

Additional file 3: The NJ tree of 116 allergen sequences from six 
categories. The topology of this tree was ger^erated usir^g MEGA 5, 
summarizing the evolutionary relationships among the allergens from 
different categories. The branches of the same category were color- 
coded. The NJ tree was consisted of 1 16 allergen proteins which met the 
condition of sequence length is between 240 and 600, and protein 
family accounted for a higher proportion within the categories. 
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